BONN: Bayesian Optimized Binary Neural Network
73
Algorithm 5 Optimizing 1-bit CNNs with Bayesian Learning
Input:
The full-precision kernels k, the reconstruction vector w, the learning rate η, regularization
parameters λ, θ and variance ν, and the training dataset.
Output:
The BONN with the updated k, w, μ, σ, cm, σm.
1: Initialize k and w randomly, and then estimate μ, σ based on the average and variance of k,
respectively;
2: repeat
3:
// Forward propagation
4:
for l = 1 to L do
5:
ˆkl
i = wl ◦sign(kl
i), ∀i; // Each element of wl is replaced by the average of all elements wl.
6:
Perform activation binarization; // Using the sign function
7:
Perform 2D convolution with ˆkl
i, ∀i;
8:
end for
9:
// Backward propagation
10:
Compute δˆkl
i = ∂Ls
∂ˆkl
i , ∀l, i;
11:
for l = L to 1 do
12:
Calculate δkl
i, δwl, δμl
i, δσl
i; // using Eqs. 3.112∼3.119
13:
Update parameters kl
i, wl, μl
i, σl
i using SGD;
14:
end for
15:
Update cm, σm;
16: until convergence
where w denotes a learned vector to reconstruct the full precision vector and is shared in a
layer. As mentioned in Section 3.2, during forward propagation, wl becomes a scalar wl in
each layer, where wl is the mean of wl and is calculated online. The convolution process is
represented as
Ol+1 = ((wl)−1 ˆ
Kl) ∗ˆ
Ol = (wl)−1( ˆ
Kl ∗ˆ
Ol),
(3.111)
where ˆ
Ol denotes the binarized feature map of the l-th layer, and Ol+1 is the feature map
of the (l + 1)-th layer. As in Eq. 3.111 depicts, the actual convolution is still binary, and
Ol+1 is obtained by simply multiplying (wl)−1 and the binarization convolution. For each
layer, only one floating-point multiplication is added, which is negligible for BONNs.
In addition, we consider the Gaussian distribution in the forward process of Bayesian
pruning, which updates every filter in one group based on its mean. Specifically, we replace
each filter Kl
i,j = (1 −γ)Kl
i,j + γK
l
j during pruning.
3.7.6
Asynchronous Backward Propagation
To minimize Eq. 3.108, we update kl,i
n , wl, μl
i, σl
i, cm, and σm using stochastic gradient
descent (SGD) in an asynchronous manner, which updates w instead of w as elaborated
below.
Updating kl,i
n : We define δkl,i
n as the gradient of the full-precision kernel kl,i
n , and we have:
δkl,i
n = ∂L
∂kl,i
n
= ∂LS
∂kl,i
n
+ ∂LB
∂kl,i
n
.
(3.112)